-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
download compression is too slow #727
Conversation
and reduce logging noise from data version checker
Tests on a medium-sized data set (3552 sequences): I measured request times of /sample/unalignedNucleotideSequences: - uncompressed: 1300 ms, 65 MB - gzip compressed: 3100 ms (before: 14 s), 8 MB - zstd compressed: 970 ms (before: 7.4 s), 342 kB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! What was the issue before? (I don't really understand what the changed code is doing.)
override fun write(bytes: ByteArray) { | ||
compressingStream.write(bytes) | ||
} | ||
|
||
override fun write( | ||
bytes: ByteArray, | ||
offset: Int, | ||
length: Int, | ||
) { | ||
compressingStream.write(bytes, offset, length) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaoran-chen Those are the relevant changes. We didn't forward those methods to the compressing stream. Instead, the default implementation would call fun write(byte: Int)
for every entry of the ByteArray
. Apparently the compression streams can do a lot better than writing every byte individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (works on my machine :) )
resolves #717
Tests on a medium-sized data set (3552 sequences):
I measured request times of /sample/unalignedNucleotideSequences:
I tested this using the Ebola dataset from Loculus. I pushed it to a branch: https://github.com/GenSpectrum/LAPIS/tree/ebolaTestData
PR Checklist
[ ] All necessary documentation has been adapted.[ ] The implemented feature is covered by an appropriate test.